Prosper Loan Data Exploration by Yeqing Zhang

## [1] 113937     81
##                    ListingKey     ListingNumber    
##  17A93590655669644DB4C06:     6   Min.   :      4  
##  349D3587495831350F0F648:     4   1st Qu.: 400919  
##  47C1359638497431975670B:     4   Median : 600554  
##  8474358854651984137201C:     4   Mean   : 627886  
##  DE8535960513435199406CE:     4   3rd Qu.: 892634  
##  04C13599434217079754AEE:     3   Max.   :1255725  
##  (Other)                :113912                    
##                     ListingCreationDate  CreditGrade         Term      
##  2013-10-02 17:20:16.550000000:     6          :84984   Min.   :12.00  
##  2013-08-28 20:31:41.107000000:     4   C      : 5649   1st Qu.:36.00  
##  2013-09-08 09:27:44.853000000:     4   D      : 5153   Median :36.00  
##  2013-12-06 05:43:13.830000000:     4   B      : 4389   Mean   :40.83  
##  2013-12-06 11:44:58.283000000:     4   AA     : 3509   3rd Qu.:36.00  
##  2013-08-21 07:25:22.360000000:     3   HR     : 3508   Max.   :60.00  
##  (Other)                      :113912   (Other): 6745                  
##                  LoanStatus                  ClosedDate   
##  Current              :56576                      :58848  
##  Completed            :38074   2014-03-04 00:00:00:  105  
##  Chargedoff           :11992   2014-02-19 00:00:00:  100  
##  Defaulted            : 5018   2014-02-11 00:00:00:   92  
##  Past Due (1-15 days) :  806   2012-10-30 00:00:00:   81  
##  Past Due (31-60 days):  363   2013-02-26 00:00:00:   78  
##  (Other)              : 1108   (Other)            :54633  
##   BorrowerAPR       BorrowerRate     LenderYield     
##  Min.   :0.00653   Min.   :0.0000   Min.   :-0.0100  
##  1st Qu.:0.15629   1st Qu.:0.1340   1st Qu.: 0.1242  
##  Median :0.20976   Median :0.1840   Median : 0.1730  
##  Mean   :0.21883   Mean   :0.1928   Mean   : 0.1827  
##  3rd Qu.:0.28381   3rd Qu.:0.2500   3rd Qu.: 0.2400  
##  Max.   :0.51229   Max.   :0.4975   Max.   : 0.4925  
##  NA's   :25                                          
##  EstimatedEffectiveYield EstimatedLoss   EstimatedReturn 
##  Min.   :-0.183          Min.   :0.005   Min.   :-0.183  
##  1st Qu.: 0.116          1st Qu.:0.042   1st Qu.: 0.074  
##  Median : 0.162          Median :0.072   Median : 0.092  
##  Mean   : 0.169          Mean   :0.080   Mean   : 0.096  
##  3rd Qu.: 0.224          3rd Qu.:0.112   3rd Qu.: 0.117  
##  Max.   : 0.320          Max.   :0.366   Max.   : 0.284  
##  NA's   :29084           NA's   :29084   NA's   :29084   
##  ProsperRating..numeric. ProsperRating..Alpha.  ProsperScore  
##  Min.   :1.000                  :29084         Min.   : 1.00  
##  1st Qu.:3.000           C      :18345         1st Qu.: 4.00  
##  Median :4.000           B      :15581         Median : 6.00  
##  Mean   :4.072           A      :14551         Mean   : 5.95  
##  3rd Qu.:5.000           D      :14274         3rd Qu.: 8.00  
##  Max.   :7.000           E      : 9795         Max.   :11.00  
##  NA's   :29084           (Other):12307         NA's   :29084  
##  ListingCategory..numeric. BorrowerState  
##  Min.   : 0.000            CA     :14717  
##  1st Qu.: 1.000            TX     : 6842  
##  Median : 1.000            NY     : 6729  
##  Mean   : 2.774            FL     : 6720  
##  3rd Qu.: 3.000            IL     : 5921  
##  Max.   :20.000                   : 5515  
##                            (Other):67493  
##                     Occupation         EmploymentStatus
##  Other                   :28617   Employed     :67322  
##  Professional            :13628   Full-time    :26355  
##  Computer Programmer     : 4478   Self-employed: 6134  
##  Executive               : 4311   Not available: 5347  
##  Teacher                 : 3759   Other        : 3806  
##  Administrative Assistant: 3688                : 2255  
##  (Other)                 :55456   (Other)      : 2718  
##  EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
##  Min.   :  0.00           False:56459         False:101218    
##  1st Qu.: 26.00           True :57478         True : 12719    
##  Median : 67.00                                               
##  Mean   : 96.07                                               
##  3rd Qu.:137.00                                               
##  Max.   :755.00                                               
##  NA's   :7625                                                 
##                     GroupKey                 DateCreditPulled 
##                         :100596   2013-12-23 09:38:12:     6  
##  783C3371218786870A73D20:  1140   2013-11-21 09:09:41:     4  
##  3D4D3366260257624AB272D:   916   2013-12-06 05:43:16:     4  
##  6A3B336601725506917317E:   698   2014-01-14 20:17:49:     4  
##  FEF83377364176536637E50:   611   2014-02-09 12:14:41:     4  
##  C9643379247860156A00EC0:   342   2013-09-27 22:04:54:     3  
##  (Other)                :  9634   (Other)            :113912  
##  CreditScoreRangeLower CreditScoreRangeUpper
##  Min.   :  0.0         Min.   : 19.0        
##  1st Qu.:660.0         1st Qu.:679.0        
##  Median :680.0         Median :699.0        
##  Mean   :685.6         Mean   :704.6        
##  3rd Qu.:720.0         3rd Qu.:739.0        
##  Max.   :880.0         Max.   :899.0        
##  NA's   :591           NA's   :591          
##         FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
##                     :   697     Min.   : 0.00      Min.   : 0.00  
##  1993-12-01 00:00:00:   185     1st Qu.: 7.00      1st Qu.: 6.00  
##  1994-11-01 00:00:00:   178     Median :10.00      Median : 9.00  
##  1995-11-01 00:00:00:   168     Mean   :10.32      Mean   : 9.26  
##  1990-04-01 00:00:00:   161     3rd Qu.:13.00      3rd Qu.:12.00  
##  1995-03-01 00:00:00:   159     Max.   :59.00      Max.   :54.00  
##  (Other)            :112389     NA's   :7604       NA's   :7604   
##  TotalCreditLinespast7years OpenRevolvingAccounts
##  Min.   :  2.00             Min.   : 0.00        
##  1st Qu.: 17.00             1st Qu.: 4.00        
##  Median : 25.00             Median : 6.00        
##  Mean   : 26.75             Mean   : 6.97        
##  3rd Qu.: 35.00             3rd Qu.: 9.00        
##  Max.   :136.00             Max.   :51.00        
##  NA's   :697                                     
##  OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries   
##  Min.   :    0.0             Min.   :  0.000      Min.   :  0.000  
##  1st Qu.:  114.0             1st Qu.:  0.000      1st Qu.:  2.000  
##  Median :  271.0             Median :  1.000      Median :  4.000  
##  Mean   :  398.3             Mean   :  1.435      Mean   :  5.584  
##  3rd Qu.:  525.0             3rd Qu.:  2.000      3rd Qu.:  7.000  
##  Max.   :14985.0             Max.   :105.000      Max.   :379.000  
##                              NA's   :697          NA's   :1159     
##  CurrentDelinquencies AmountDelinquent   DelinquenciesLast7Years
##  Min.   : 0.0000      Min.   :     0.0   Min.   : 0.000         
##  1st Qu.: 0.0000      1st Qu.:     0.0   1st Qu.: 0.000         
##  Median : 0.0000      Median :     0.0   Median : 0.000         
##  Mean   : 0.5921      Mean   :   984.5   Mean   : 4.155         
##  3rd Qu.: 0.0000      3rd Qu.:     0.0   3rd Qu.: 3.000         
##  Max.   :83.0000      Max.   :463881.0   Max.   :99.000         
##  NA's   :697          NA's   :7622       NA's   :990            
##  PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
##  Min.   : 0.0000          Min.   : 0.000            Min.   :      0       
##  1st Qu.: 0.0000          1st Qu.: 0.000            1st Qu.:   3121       
##  Median : 0.0000          Median : 0.000            Median :   8549       
##  Mean   : 0.3126          Mean   : 0.015            Mean   :  17599       
##  3rd Qu.: 0.0000          3rd Qu.: 0.000            3rd Qu.:  19521       
##  Max.   :38.0000          Max.   :20.000            Max.   :1435667       
##  NA's   :697              NA's   :7604              NA's   :7604          
##  BankcardUtilization AvailableBankcardCredit  TotalTrades    
##  Min.   :0.000       Min.   :     0          Min.   :  0.00  
##  1st Qu.:0.310       1st Qu.:   880          1st Qu.: 15.00  
##  Median :0.600       Median :  4100          Median : 22.00  
##  Mean   :0.561       Mean   : 11210          Mean   : 23.23  
##  3rd Qu.:0.840       3rd Qu.: 13180          3rd Qu.: 30.00  
##  Max.   :5.950       Max.   :646285          Max.   :126.00  
##  NA's   :7604        NA's   :7544            NA's   :7544    
##  TradesNeverDelinquent..percentage. TradesOpenedLast6Months
##  Min.   :0.000                      Min.   : 0.000         
##  1st Qu.:0.820                      1st Qu.: 0.000         
##  Median :0.940                      Median : 0.000         
##  Mean   :0.886                      Mean   : 0.802         
##  3rd Qu.:1.000                      3rd Qu.: 1.000         
##  Max.   :1.000                      Max.   :20.000         
##  NA's   :7544                       NA's   :7544           
##  DebtToIncomeRatio         IncomeRange    IncomeVerifiable
##  Min.   : 0.000    $25,000-49,999:32192   False:  8669    
##  1st Qu.: 0.140    $50,000-74,999:31050   True :105268    
##  Median : 0.220    $100,000+     :17337                   
##  Mean   : 0.276    $75,000-99,999:16916                   
##  3rd Qu.: 0.320    Not displayed : 7741                   
##  Max.   :10.010    $1-24,999     : 7274                   
##  NA's   :8554      (Other)       : 1427                   
##  StatedMonthlyIncome                    LoanKey       TotalProsperLoans
##  Min.   :      0     CB1B37030986463208432A1:     6   Min.   :0.00     
##  1st Qu.:   3200     2DEE3698211017519D7333F:     4   1st Qu.:1.00     
##  Median :   4667     9F4B37043517554537C364C:     4   Median :1.00     
##  Mean   :   5608     D895370150591392337ED6D:     4   Mean   :1.42     
##  3rd Qu.:   6825     E6FB37073953690388BC56D:     4   3rd Qu.:2.00     
##  Max.   :1750003     0D8F37036734373301ED419:     3   Max.   :8.00     
##                      (Other)                :113912   NA's   :91852    
##  TotalProsperPaymentsBilled OnTimeProsperPayments
##  Min.   :  0.00             Min.   :  0.00       
##  1st Qu.:  9.00             1st Qu.:  9.00       
##  Median : 16.00             Median : 15.00       
##  Mean   : 22.93             Mean   : 22.27       
##  3rd Qu.: 33.00             3rd Qu.: 32.00       
##  Max.   :141.00             Max.   :141.00       
##  NA's   :91852              NA's   :91852        
##  ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
##  Min.   : 0.00                       Min.   : 0.00                  
##  1st Qu.: 0.00                       1st Qu.: 0.00                  
##  Median : 0.00                       Median : 0.00                  
##  Mean   : 0.61                       Mean   : 0.05                  
##  3rd Qu.: 0.00                       3rd Qu.: 0.00                  
##  Max.   :42.00                       Max.   :21.00                  
##  NA's   :91852                       NA's   :91852                  
##  ProsperPrincipalBorrowed ProsperPrincipalOutstanding
##  Min.   :    0            Min.   :    0              
##  1st Qu.: 3500            1st Qu.:    0              
##  Median : 6000            Median : 1627              
##  Mean   : 8472            Mean   : 2930              
##  3rd Qu.:11000            3rd Qu.: 4127              
##  Max.   :72499            Max.   :23451              
##  NA's   :91852            NA's   :91852              
##  ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
##  Min.   :-209.00             Min.   :   0.0           
##  1st Qu.: -35.00             1st Qu.:   0.0           
##  Median :  -3.00             Median :   0.0           
##  Mean   :  -3.22             Mean   : 152.8           
##  3rd Qu.:  25.00             3rd Qu.:   0.0           
##  Max.   : 286.00             Max.   :2704.0           
##  NA's   :95009                                        
##  LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination   LoanNumber    
##  Min.   : 0.00                 Min.   :  0.0              Min.   :     1  
##  1st Qu.: 9.00                 1st Qu.:  6.0              1st Qu.: 37332  
##  Median :14.00                 Median : 21.0              Median : 68599  
##  Mean   :16.27                 Mean   : 31.9              Mean   : 69444  
##  3rd Qu.:22.00                 3rd Qu.: 65.0              3rd Qu.:101901  
##  Max.   :44.00                 Max.   :100.0              Max.   :136486  
##  NA's   :96985                                                            
##  LoanOriginalAmount          LoanOriginationDate LoanOriginationQuarter
##  Min.   : 1000      2014-01-22 00:00:00:   491   Q4 2013:14450         
##  1st Qu.: 4000      2013-11-13 00:00:00:   490   Q1 2014:12172         
##  Median : 6500      2014-02-19 00:00:00:   439   Q3 2013: 9180         
##  Mean   : 8337      2013-10-16 00:00:00:   434   Q2 2013: 7099         
##  3rd Qu.:12000      2014-01-28 00:00:00:   339   Q3 2012: 5632         
##  Max.   :35000      2013-09-24 00:00:00:   316   Q2 2012: 5061         
##                     (Other)            :111428   (Other):60343         
##                    MemberKey      MonthlyLoanPayment LP_CustomerPayments
##  63CA34120866140639431C9:     9   Min.   :   0.0     Min.   :   -2.35   
##  16083364744933457E57FB9:     8   1st Qu.: 131.6     1st Qu.: 1005.76   
##  3A2F3380477699707C81385:     8   Median : 217.7     Median : 2583.83   
##  4D9C3403302047712AD0CDD:     8   Mean   : 272.5     Mean   : 4183.08   
##  739C338135235294782AE75:     8   3rd Qu.: 371.6     3rd Qu.: 5548.40   
##  7E1733653050264822FAA3D:     8   Max.   :2251.5     Max.   :40702.39   
##  (Other)                :113888                                         
##  LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees   
##  Min.   :    0.0              Min.   :   -2.35   Min.   :-664.87  
##  1st Qu.:  500.9              1st Qu.:  274.87   1st Qu.: -73.18  
##  Median : 1587.5              Median :  700.84   Median : -34.44  
##  Mean   : 3105.5              Mean   : 1077.54   Mean   : -54.73  
##  3rd Qu.: 4000.0              3rd Qu.: 1458.54   3rd Qu.: -13.92  
##  Max.   :35000.0              Max.   :15617.03   Max.   :  32.06  
##                                                                   
##  LP_CollectionFees  LP_GrossPrincipalLoss LP_NetPrincipalLoss
##  Min.   :-9274.75   Min.   :  -94.2       Min.   : -954.5    
##  1st Qu.:    0.00   1st Qu.:    0.0       1st Qu.:    0.0    
##  Median :    0.00   Median :    0.0       Median :    0.0    
##  Mean   :  -14.24   Mean   :  700.4       Mean   :  681.4    
##  3rd Qu.:    0.00   3rd Qu.:    0.0       3rd Qu.:    0.0    
##  Max.   :    0.00   Max.   :25000.0       Max.   :25000.0    
##                                                              
##  LP_NonPrincipalRecoverypayments PercentFunded    Recommendations   
##  Min.   :    0.00                Min.   :0.7000   Min.   : 0.00000  
##  1st Qu.:    0.00                1st Qu.:1.0000   1st Qu.: 0.00000  
##  Median :    0.00                Median :1.0000   Median : 0.00000  
##  Mean   :   25.14                Mean   :0.9986   Mean   : 0.04803  
##  3rd Qu.:    0.00                3rd Qu.:1.0000   3rd Qu.: 0.00000  
##  Max.   :21117.90                Max.   :1.0125   Max.   :39.00000  
##                                                                     
##  InvestmentFromFriendsCount InvestmentFromFriendsAmount   Investors      
##  Min.   : 0.00000           Min.   :    0.00            Min.   :   1.00  
##  1st Qu.: 0.00000           1st Qu.:    0.00            1st Qu.:   2.00  
##  Median : 0.00000           Median :    0.00            Median :  44.00  
##  Mean   : 0.02346           Mean   :   16.55            Mean   :  80.48  
##  3rd Qu.: 0.00000           3rd Qu.:    0.00            3rd Qu.: 115.00  
##  Max.   :33.00000           Max.   :25000.00            Max.   :1189.00  
## 
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Introduction

This data exploration is on a large dataset consisted of 113,937 loans from Prosper Marketplace. The major goal of this exercise is to analyze what factors and how they affect the Prosper Score. Prosper Score ranges from 1 to 11 and is an indicator of the risk of Prosper borrower listings (the highest risk or worst is 1). I will start by looking at the distribution of each variable that I am interested. I will further explore the relationships between different variables and how they correlate with Prosper Score. Finally a model based on the findings will be created in order to predict the Prosper Score.

Univariate Plots Section

Prosper Score

plot of chunk ProsperScore_hist

The Prosper Score seems to be mostly from 4 to 8. The score with the largest population is 4.

plot of chunk CreditGrade_hist

There are quite a lot of missing values (empty values) in the credit grade.

plot of chunk CreditGrade_hist2

After removing the missing category, it is shown that the number of loans increases with decrease of CreditGrade until “C”. Then it drops with decrease of CreditGrade. There are very few loans with CreditGrade of “NC”. Why dont't people tend to borrow loans with the highest CreditGrade “A”, but rather a medium level “C”?

Loan Amount

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The median value for the original loan amount is 6500, mean value is 8337.

plot of chunk LoanOriginalAmount_hist

It is noticed that the majority of the loans have below 10000. And it seems that the distribution is multi-modal, with multiple peaks at 4500, 10000, 15000, 20000, 25000. Let's transform the plot into a better understandable plot.

plot of chunk loan_original_amount_hist

It shows that with the increase of LoanOriginalAmount, number of loans decreases. Also, there are quite a few spikes above the normal trend which indicates the distribution is quite sparse, with most of loans with amount at times of 5000. There are very few loans with original amount over 25000.

Income

plot of chunk income_range_hist

Most of the borrowers have annual income range of $25K-$75K.

plot of chunk DebtToIncomeRatio_hist

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

DebtToIncomeRatio shows a near-normal distribution on log scale, peaking at 0.2.

plot of chunk StatedMonthlyIncome_hist

StatedMonthlyIncome presents a normal distribution on the log scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

The StatedMonthlyIncome's median is 4667 and mean value is 5608.

Occupation and Employment

plot of chunk occupation_hist

The majority of loans are from “Other” which indicates the distribution of the occupation is long-tail.

plot of chunk employment_status_hist

The majority of the loans are from “Employed” and “Full-Time” browers.

Interest Rate and Return

plot of chunk BorrowerAPR_hist

BorrowerAPR presents a distribution similar to a bell shape, peaking at around 0.18. There are also other 3 spikes at 0.09, 0.29, 0.37. Let's do a close look.

plot of chunk BorrowerAPR_hist2

A finer tuned histogram shows a similar trend as above but with a prominent peak at 0.37. I wonder why there are so many loans with BorrowAPR at 37%? Also, the distribution seems to be multi-modal, particularly peaked at the low-level 0.09.

plot of chunk LenderYield_hist2

plot of chunk estimated_effective_yield_hist

plot of chunk estimated_effective_yield_hist2

plot of chunk estimated_loss_hist

plot of chunk estimated_loss_hist2

plot of chunk estimated_return_hist

It appears that the Estimated Return poses a normal distribution, peaking at 0.1 and with some values below 0.

plot of chunk estimated_return_hist2

The transformed EstimatedReturn seems to be tri-modal peaking at 0.08, 0.11, 0.12, 0.14.

Credit History

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk total_prosper_loans_hist

The number of loans appears to be exponentially decreasing with increase of TotalProsperLoans.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.00    1.00    1.42    2.00    8.00   91852
## [1] "% NAs:  0.806164810377665 "

The summary shows there are 91852 NA values, about 80% of total observations. This means the majority of the borrowers are first-time borrowers on Prosper.

plot of chunk total_prosper_loans_hist2

Above plot confirms the negative exponential relationship.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk TotalProsperPaymentsBilled_hist

plot of chunk TotalProsperPaymentsBilled_hist2

The above plot shows that the payment bills seem to be quite strange. It seems to be consisted of a couple of exponential decreasing distributions starting from 1, 7, 11, 35. There is a prominent spike at 35.

##    [1]  11  67  12   8   9  20   9   1   4  12  16   6   6   9  10  32  35
##   [18]  14  11   6  14  29   5  22   4  12  13  24   3  10   9   1  25   4
##   [35]  15  46  14   2   6  13  30  35   2  12  27   9   9  18  18  28  12
##   [52]  16  26   3  34  74  13   9  35  12  35   9  16  10   2  35   6  35
##   [69]   2  36  12  34  54  35  63   7  50   7  32  33  34  21  19   8  38
##   [86]  15  27   9  13  42   9  39  35  34  65  39  11  23   8  95   9  15
##  [103]  56   7  63  53  50   7   8   9  66  25   6   2  20  51  11   7  11
##  [120]  27   9  22  29  32   1   8  33  29  12   7  35   4   6  76   7   6
##  [137]  46  35  12  12  20   9  13  25  16  67  17  13  15  20   7   1  11
##  [154]  14  20   6  11  11  14   7  29   3  10  51  48  25   2  24  18  34
##  [171]  67   8  38  23  22  11  78  19  25  11  11   9  26  11   6  10  17
##  [188]  30  50  32  33  11  28  35  11  24  22  14  22   4   6  12  13  11
##  [205]  13  10  53 107  17   9  13  35  26  46  24  10  22   1   7  17  35
##  [222]  34  12  35  26   2  13  15   6  19   8  77 131  35  58  20  18  36
##  [239]  10   6   2   9   6   6  19   9  36  14  39  30  14   6   4  23   7
##  [256]  18   7  42  21  44   6  20  13  29   3  27 111  61  15  10  61  13
##  [273]   5  31  36  50  77  23  35  55  11  11   7  28   4  15  59  30  26
##  [290]  28   9  10  35  14   4  11  14   8  26  52   7  13 110   7  36  19
##  [307]   6   4  10  26  32   8  62  47  11  58   3   6  14  13  67  11  35
##  [324]  16  16  11  39  17  37  11  17  17  42  10  78  69  47   9  28  34
##  [341]   5   9  20   9   6  12  27   1  38  31  36   8   1   8  10  16  13
##  [358]   9  73   8  19   4  32  11  15  17  32  30   8  57  12  23   6  21
##  [375]  35   6   3  64  14  10  89  26  66   6  10   8  18  65  15  26  23
##  [392]  34   9  48   6  34  27  54  26  11  23  21  57  14  12  16   9  61
##  [409]  15  72   8  18  11   9   5   6  52  28   7  29  35  21  64  44  34
##  [426]  24  16   9  34  51   9  14   2  47  35  11  21  61  35  61   7  24
##  [443] 101  83  53  57  29  47  19  24  62   1  35  12  10   7   9  14   6
##  [460] 103  12  10  22  35  11  24   6  21  14   9  11  13  16  36   7   8
##  [477]  10  14   8  10  70   6   6  11   6  17  27   1  22  28  24  34  31
##  [494]  14   6   7  32  35  28  14   9   9  46  35  16  14   5  33   9  34
##  [511]   6   9   9   9   2  16  47  14  35 101   9  52  20  19  16  10   2
##  [528]  20  40  10  21  15  18  35  21   6  15  35  19   6  12   6  67  27
##  [545]   9  57  11  25   6  24   9   9  11  27  36  32  35 102   9  33   2
##  [562]  12  18  15  44   9  14  33  50  35  20  54  10  23  13  11  24  54
##  [579]  14  19  14  24  44  44  42  19  10  22  73   7  10  15  22  18  51
##  [596]   1   6   6  14  51  21  35  21  38   8  25  49  11  18  11  48   6
##  [613]   9  17   2   1  17  42  16  10  24  28  13  18  28   9  60  11  17
##  [630]  16  41  10  38  25  13   1   9  24   4   9  14 102   9  52  11  14
##  [647]  12  19   1  45  26  76  12  10   8  18  10  35  20  15  22  55  18
##  [664]  17  10  43   9  45  12  44  20  47  15  15  32  46  13  18  12   1
##  [681]  24  40  70   9  75  31  51  12  24  26   0  30   2  18  31  21   9
##  [698]  16   6  35  15  10   6  70  22  68  49   1  43  46  14  40   3  10
##  [715]  25   7  24  13  16  20  10  55  32  53   4  22  13  49  26  35  15
##  [732]  11   1  20  13  30  17  28  48  28   5   1  83  36   6   1  41   8
##  [749]  35  21  29  53   9  56  81  43   1  56  68  10 118  11  32   4  47
##  [766]   6  46  10   6  47  35   9   7  53  26  49  13  33  10  16  12  14
##  [783]   4  39  15  29   9   9  62   8  25  11  32  13  10  35  11  38  18
##  [800]  33   8  44  35   1  31   4  10  19   5  29  35  16  43  33  63  42
##  [817]  10  13   7  12   9  71   8   9   3  11   9   7  17  21  25   9  17
##  [834]   9  61   9  19  46  12  68  16   7  14  10  23  30  44  33  46  33
##  [851]   9  15   6  21  20  10   9  16  30  27   8  14  36   6  17  21  30
##  [868]  11   6   9  34  21  12  48  18  20  15  47  15  21  11  10  49  11
##  [885]  64  20  10  20  55  11   9  26  35  36  24  35  13  48  49  20   6
##  [902]   9  31  10  57  90  22   1  10  14  33  12  51   9  25   3  12   7
##  [919]  20  24  25   7  14  18   9  13  17  10  88  35  35  10  14  30  14
##  [936]   1   9   6  13  22  30   4   2   9  21  17  15   8  33  35  15   7
##  [953]  53  15  32   7  24  25  23   9  23  53   5   9   9  16  16  13  14
##  [970]  22  35  14  35  19  10  45   6  12   2  34  44   7  32   5  35  13
##  [987]  30   7   8   9   9  15  35   8   8  33   6  44   4   6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.00   16.00   22.93   33.00  141.00   91852

The summary shows there are 91852 NA values, about 80% of total observations. This means the majority of

plot of chunk CreditLines_hist

plot of chunk Delinquencies_hist

It shows that CurrentDelinquencies and DelinquenciesLast7Years present decreasing exponential distributions while AmountDelinquent presents a bell shape on log scale.

Date

plot of chunk LoanDate_hist

The time series plot shows the loans increase since 2006 and stumbled to 0 since end of 2008 (due to the finacial crisis). It recovers to increase since the end of 2009 and had a significant jump at the beginning of 2013.

plot of chunk loan_status_hist

## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304
## 
##              Cancelled             Chargedoff              Completed 
##           0.0000438839           0.1052511476           0.3341671274 
##                Current              Defaulted FinalPaymentInProgress 
##           0.4965551138           0.0440418828           0.0017992399 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##           0.0001404285           0.0070740848           0.0023258467 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##           0.0031859712           0.0027471322           0.0026681412

The majority of the loans are Current (50%) and 33% are completed. There are 4.4% of defaulted loans and 10.5% of charged-off loans.

Univariate Analysis

What is the structure of your dataset?

The dataset contains total 113,937 observations with 81 variables.

Interesting categorical variables:

What is/are the main feature(s) of interest in your dataset?

The most interesting feature is ProsperScore, EstimatedReturn, BorrowerAPR, LoanOriginalAmount because I'd like to create a preditive model that predicts the ProsperScore that informs the lender on a loan decision. I believe EstimatedReturn, BorrowerAPR, LoanOriginalAmount can be important indicators of the ProsperScore.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think three categories of variables will help predict the model I would like to create.

Did you create any new variables from existing variables in the dataset?

I created 4 new variables which converts the factor variables relating to date (ListingCreationDate, ClosedDate, DateCreditPulled, FirstRecordedCreditLine) into Date type. I created them due to the need to draw the time series plot of LoanOriginationDate.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

In the historgram of TotalProsperLoans, I adjusted the Y Axis to log scale. This allows the negative exponential relationship to be better presented. StatedMonthlyIncome histogram's X Axis was adjusted to log scale in order to present its normal distribution.

I also adjusted the bin width for a few histograms including LoanOriginalAmount, EstimatedReturn, EstimatedLoss, BorrowerAPR. I did so because default binwidth from qplot is too large so that some of the finer patterns (such as spikes over times of 5000 of LoanOriginalAmount).

Bivariate Plots Section

Correlaion

##                             ProsperScore   BorrowerAPR   LenderYield
## ProsperScore                 1.000000000 -0.7756081832 -0.7568693207
## BorrowerAPR                 -0.775608183  1.0000000000  0.9933354234
## LenderYield                 -0.756869321  0.9933354234  1.0000000000
## EstimatedEffectiveYield     -0.722746101  0.7975218152  0.7925556235
## EstimatedLoss               -0.717750956  0.9153079884  0.9129083651
## EstimatedReturn             -0.526230487  0.7408701771  0.7594150419
## DebtToIncomeRatio           -0.188196543  0.1752669698  0.1711300152
## StatedMonthlyIncome          0.176385631 -0.1981553243 -0.1946869337
## TotalProsperLoans            0.068466164 -0.0494553437 -0.0491863786
## TotalProsperPaymentsBilled   0.045597505  0.0261556567  0.0251046567
## CreditScoreRangeLower        0.483291107 -0.7031465276 -0.6858133948
## CreditScoreRangeUpper        0.483291107 -0.7031465276 -0.6858133948
## CurrentCreditLines          -0.051972118  0.0236402185  0.0245897893
## OpenCreditLines             -0.027991537  0.0039527524  0.0054184901
## TotalCreditLinespast7years  -0.130740839  0.1013894315  0.1010506039
## OpenRevolvingAccounts        0.036237188 -0.0425180489 -0.0412906803
## OpenRevolvingMonthlyPayment -0.001285185  0.0005081251 -0.0005642008
## CurrentDelinquencies        -0.184801155  0.1999374629  0.1951766603
## AmountDelinquent            -0.073729852  0.0661589858  0.0621256703
## DelinquenciesLast7Years     -0.141045290  0.1709799691  0.1665368389
##                             EstimatedEffectiveYield EstimatedLoss
## ProsperScore                           -0.722746101  -0.717750956
## BorrowerAPR                             0.797521815   0.915307988
## LenderYield                             0.792555623   0.912908365
## EstimatedEffectiveYield                 1.000000000   0.579764141
## EstimatedLoss                           0.579764141   1.000000000
## EstimatedReturn                         0.811548399   0.432283571
## DebtToIncomeRatio                       0.185199666   0.151735903
## StatedMonthlyIncome                    -0.144907090  -0.184053781
## TotalProsperLoans                      -0.007421791  -0.035702291
## TotalProsperPaymentsBilled              0.011213550   0.045366434
## CreditScoreRangeLower                  -0.587233268  -0.649480807
## CreditScoreRangeUpper                  -0.587233268  -0.649480807
## CurrentCreditLines                      0.055681024   0.017747170
## OpenCreditLines                         0.043942319   0.000775706
## TotalCreditLinespast7years              0.106500038   0.095720279
## OpenRevolvingAccounts                  -0.003185943  -0.047563091
## OpenRevolvingMonthlyPayment             0.043379178  -0.019428240
## CurrentDelinquencies                    0.150200026   0.210329006
## AmountDelinquent                        0.065092198   0.061061195
## DelinquenciesLast7Years                 0.137864873   0.166264744
##                             EstimatedReturn DebtToIncomeRatio
## ProsperScore                  -0.5262304875      -0.188196543
## BorrowerAPR                    0.7408701771       0.175266970
## LenderYield                    0.7594150419       0.171130015
## EstimatedEffectiveYield        0.8115483985       0.185199666
## EstimatedLoss                  0.4322835715       0.151735903
## EstimatedReturn                1.0000000000       0.130799420
## DebtToIncomeRatio              0.1307994197       1.000000000
## StatedMonthlyIncome           -0.1428056377      -0.200341005
## TotalProsperLoans             -0.0614952567       0.008303509
## TotalProsperPaymentsBilled    -0.0158679317       0.014252417
## CreditScoreRangeLower         -0.4775955525      -0.068101400
## CreditScoreRangeUpper         -0.4775955525      -0.068101400
## CurrentCreditLines             0.0173980251       0.141700358
## OpenCreditLines               -0.0004439253       0.147738368
## TotalCreditLinespast7years     0.0678500227       0.073401764
## OpenRevolvingAccounts         -0.0230184363       0.111563761
## OpenRevolvingMonthlyPayment    0.0228440506       0.110171436
## CurrentDelinquencies           0.0980637280      -0.033791124
## AmountDelinquent               0.0384208893      -0.033351034
## DelinquenciesLast7Years        0.1009121335      -0.066321404
##                             StatedMonthlyIncome TotalProsperLoans
## ProsperScore                        0.176385631       0.068466164
## BorrowerAPR                        -0.198155324      -0.049455344
## LenderYield                        -0.194686934      -0.049186379
## EstimatedEffectiveYield            -0.144907090      -0.007421791
## EstimatedLoss                      -0.184053781      -0.035702291
## EstimatedReturn                    -0.142805638      -0.061495257
## DebtToIncomeRatio                  -0.200341005       0.008303509
## StatedMonthlyIncome                 1.000000000       0.022127365
## TotalProsperLoans                   0.022127365       1.000000000
## TotalProsperPaymentsBilled         -0.008263897       0.709828656
## CreditScoreRangeLower               0.149756673      -0.007428554
## CreditScoreRangeUpper               0.149756673      -0.007428554
## CurrentCreditLines                  0.261973105       0.088060143
## OpenCreditLines                     0.254406275       0.070520470
## TotalCreditLinespast7years          0.234230348       0.132494357
## OpenRevolvingAccounts               0.174564346       0.036855177
## OpenRevolvingMonthlyPayment         0.367056149       0.009217599
## CurrentDelinquencies               -0.025870825       0.000478030
## AmountDelinquent                    0.026108752       0.001815828
## DelinquenciesLast7Years            -0.031009072      -0.027535815
##                             TotalProsperPaymentsBilled
## ProsperScore                               0.045597505
## BorrowerAPR                                0.026155657
## LenderYield                                0.025104657
## EstimatedEffectiveYield                    0.011213550
## EstimatedLoss                              0.045366434
## EstimatedReturn                           -0.015867932
## DebtToIncomeRatio                          0.014252417
## StatedMonthlyIncome                       -0.008263897
## TotalProsperLoans                          0.709828656
## TotalProsperPaymentsBilled                 1.000000000
## CreditScoreRangeLower                     -0.071590149
## CreditScoreRangeUpper                     -0.071590149
## CurrentCreditLines                         0.062899685
## OpenCreditLines                            0.052725632
## TotalCreditLinespast7years                 0.065882171
## OpenRevolvingAccounts                      0.016109632
## OpenRevolvingMonthlyPayment                0.012702300
## CurrentDelinquencies                       0.051584683
## AmountDelinquent                           0.025252482
## DelinquenciesLast7Years                    0.022932986
##                             CreditScoreRangeLower CreditScoreRangeUpper
## ProsperScore                          0.483291107           0.483291107
## BorrowerAPR                          -0.703146528          -0.703146528
## LenderYield                          -0.685813395          -0.685813395
## EstimatedEffectiveYield              -0.587233268          -0.587233268
## EstimatedLoss                        -0.649480807          -0.649480807
## EstimatedReturn                      -0.477595552          -0.477595552
## DebtToIncomeRatio                    -0.068101400          -0.068101400
## StatedMonthlyIncome                   0.149756673           0.149756673
## TotalProsperLoans                    -0.007428554          -0.007428554
## TotalProsperPaymentsBilled           -0.071590149          -0.071590149
## CreditScoreRangeLower                 1.000000000           1.000000000
## CreditScoreRangeUpper                 1.000000000           1.000000000
## CurrentCreditLines                    0.041655502           0.041655502
## OpenCreditLines                       0.047218968           0.047218968
## TotalCreditLinespast7years           -0.012833413          -0.012833413
## OpenRevolvingAccounts                 0.064060292           0.064060292
## OpenRevolvingMonthlyPayment           0.013485898           0.013485898
## CurrentDelinquencies                 -0.206635053          -0.206635053
## AmountDelinquent                     -0.065088824          -0.065088824
## DelinquenciesLast7Years              -0.205795835          -0.205795835
##                             CurrentCreditLines OpenCreditLines
## ProsperScore                       -0.05197212   -0.0279915371
## BorrowerAPR                         0.02364022    0.0039527524
## LenderYield                         0.02458979    0.0054184901
## EstimatedEffectiveYield             0.05568102    0.0439423192
## EstimatedLoss                       0.01774717    0.0007757060
## EstimatedReturn                     0.01739803   -0.0004439253
## DebtToIncomeRatio                   0.14170036    0.1477383684
## StatedMonthlyIncome                 0.26197311    0.2544062748
## TotalProsperLoans                   0.08806014    0.0705204703
## TotalProsperPaymentsBilled          0.06289968    0.0527256320
## CreditScoreRangeLower               0.04165550    0.0472189680
## CreditScoreRangeUpper               0.04165550    0.0472189680
## CurrentCreditLines                  1.00000000    0.9524770655
## OpenCreditLines                     0.95247707    1.0000000000
## TotalCreditLinespast7years          0.61617624    0.5628001179
## OpenRevolvingAccounts               0.83874246    0.8766942654
## OpenRevolvingMonthlyPayment         0.52962909    0.5498678426
## CurrentDelinquencies               -0.15011637   -0.1376370922
## AmountDelinquent                   -0.08299808   -0.0752859330
## DelinquenciesLast7Years            -0.18059603   -0.1785594837
##                             TotalCreditLinespast7years
## ProsperScore                               -0.13074084
## BorrowerAPR                                 0.10138943
## LenderYield                                 0.10105060
## EstimatedEffectiveYield                     0.10650004
## EstimatedLoss                               0.09572028
## EstimatedReturn                             0.06785002
## DebtToIncomeRatio                           0.07340176
## StatedMonthlyIncome                         0.23423035
## TotalProsperLoans                           0.13249436
## TotalProsperPaymentsBilled                  0.06588217
## CreditScoreRangeLower                      -0.01283341
## CreditScoreRangeUpper                      -0.01283341
## CurrentCreditLines                          0.61617624
## OpenCreditLines                             0.56280012
## TotalCreditLinespast7years                  1.00000000
## OpenRevolvingAccounts                       0.49021591
## OpenRevolvingMonthlyPayment                 0.32678090
## CurrentDelinquencies                        0.11001538
## AmountDelinquent                            0.05817124
## DelinquenciesLast7Years                     0.13835710
##                             OpenRevolvingAccounts
## ProsperScore                          0.036237188
## BorrowerAPR                          -0.042518049
## LenderYield                          -0.041290680
## EstimatedEffectiveYield              -0.003185943
## EstimatedLoss                        -0.047563091
## EstimatedReturn                      -0.023018436
## DebtToIncomeRatio                     0.111563761
## StatedMonthlyIncome                   0.174564346
## TotalProsperLoans                     0.036855177
## TotalProsperPaymentsBilled            0.016109632
## CreditScoreRangeLower                 0.064060292
## CreditScoreRangeUpper                 0.064060292
## CurrentCreditLines                    0.838742457
## OpenCreditLines                       0.876694265
## TotalCreditLinespast7years            0.490215905
## OpenRevolvingAccounts                 1.000000000
## OpenRevolvingMonthlyPayment           0.559699486
## CurrentDelinquencies                 -0.133712438
## AmountDelinquent                     -0.068939857
## DelinquenciesLast7Years              -0.170037309
##                             OpenRevolvingMonthlyPayment
## ProsperScore                              -0.0012851853
## BorrowerAPR                                0.0005081251
## LenderYield                               -0.0005642008
## EstimatedEffectiveYield                    0.0433791784
## EstimatedLoss                             -0.0194282400
## EstimatedReturn                            0.0228440506
## DebtToIncomeRatio                          0.1101714356
## StatedMonthlyIncome                        0.3670561492
## TotalProsperLoans                          0.0092175989
## TotalProsperPaymentsBilled                 0.0127022998
## CreditScoreRangeLower                      0.0134858976
## CreditScoreRangeUpper                      0.0134858976
## CurrentCreditLines                         0.5296290929
## OpenCreditLines                            0.5498678426
## TotalCreditLinespast7years                 0.3267808977
## OpenRevolvingAccounts                      0.5596994862
## OpenRevolvingMonthlyPayment                1.0000000000
## CurrentDelinquencies                      -0.1238420972
## AmountDelinquent                          -0.0510756543
## DelinquenciesLast7Years                   -0.1776844109
##                             CurrentDelinquencies AmountDelinquent
## ProsperScore                         -0.18480115     -0.073729852
## BorrowerAPR                           0.19993746      0.066158986
## LenderYield                           0.19517666      0.062125670
## EstimatedEffectiveYield               0.15020003      0.065092198
## EstimatedLoss                         0.21032901      0.061061195
## EstimatedReturn                       0.09806373      0.038420889
## DebtToIncomeRatio                    -0.03379112     -0.033351034
## StatedMonthlyIncome                  -0.02587082      0.026108752
## TotalProsperLoans                     0.00047803      0.001815828
## TotalProsperPaymentsBilled            0.05158468      0.025252482
## CreditScoreRangeLower                -0.20663505     -0.065088824
## CreditScoreRangeUpper                -0.20663505     -0.065088824
## CurrentCreditLines                   -0.15011637     -0.082998078
## OpenCreditLines                      -0.13763709     -0.075285933
## TotalCreditLinespast7years            0.11001538      0.058171238
## OpenRevolvingAccounts                -0.13371244     -0.068939857
## OpenRevolvingMonthlyPayment          -0.12384210     -0.051075654
## CurrentDelinquencies                  1.00000000      0.436999384
## AmountDelinquent                      0.43699938      1.000000000
## DelinquenciesLast7Years               0.42058704      0.309680266
##                             DelinquenciesLast7Years
## ProsperScore                            -0.14104529
## BorrowerAPR                              0.17097997
## LenderYield                              0.16653684
## EstimatedEffectiveYield                  0.13786487
## EstimatedLoss                            0.16626474
## EstimatedReturn                          0.10091213
## DebtToIncomeRatio                       -0.06632140
## StatedMonthlyIncome                     -0.03100907
## TotalProsperLoans                       -0.02753582
## TotalProsperPaymentsBilled               0.02293299
## CreditScoreRangeLower                   -0.20579584
## CreditScoreRangeUpper                   -0.20579584
## CurrentCreditLines                      -0.18059603
## OpenCreditLines                         -0.17855948
## TotalCreditLinespast7years               0.13835710
## OpenRevolvingAccounts                   -0.17003731
## OpenRevolvingMonthlyPayment             -0.17768441
## CurrentDelinquencies                     0.42058704
## AmountDelinquent                         0.30968027
## DelinquenciesLast7Years                  1.00000000

The correlation matrix shows that ProsperScore has strong negative correlation with BorrowerAPR, LenderYield, EstimatedEffectiveYield, EstimatedLoss and fair negative correlation with EstimatedReturn. BorrowerAPR and LenderYield are correlated with each other. ProsperScore has strong positive correlation with CreditScoreRangeLower and CreditScoreRangeUpper (these two are correlated with each other). It has some correlation with DebtToIncomeRatio (negative) and StatedMonthlyIncome (positive). We should closely look at the relationship of ProsperScore and the interest-related variables.

A quick preview of these correlation is plotted as below.

plot of chunk PS_v_Income1

The boxplots visually present that ProsperScore is negatively correlated with BorrowerAPR, LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn. This makes sense as it is intuitive for lenders to get higher return at higher risk (lower ProsperScore).

plot of chunk PS_v_BorrowerAPR_scatter

The overall BorrowerAPR posed a decreasing trend with increase of ProsperScore. However, it is noticed that there are a few dense area of BorrowerAPR around 0.36 for ProsperScore within 2-5. What are those loans and why they stay close to 0.36? I'd like to plot the histograms for each ProsperScore to highlight this pattern.

plot of chunk PS_v_BorrowerAPR_hist

The above plots confirms that there are quite many loans (ProsperScore <= 5) are at BorrowerAPR ~ 0.35. What are they? Also, for loans with ProsperScore in 6-7, the BorrowersAPR distribution appears to be bimodal or multimodal peaking at 0.2 and 0.3. I'm going to investigate a little more on those odd loans.

## 
##  0.3504  0.3505 0.35088  0.3509 0.35097 0.35118 0.35132 0.35169  0.3522 
##       1       5       1      50      27       1     254       1       1 
## 0.35244  0.3525  0.3527 0.35285 0.35356 0.35365 0.35372 0.35394 0.35423 
##      69       1       1     214     656      11       1       1       7 
## 0.35495 0.35602 0.35617 0.35643 0.35644 0.35654 0.35699 0.35772 0.35797 
##       1       2       1    1531       1       7       2       1    3410 
## 0.35811 0.35819 0.35838  0.3584 0.35842 0.35843 0.35858  0.3586 0.35879 
##       1       1     117      15       1     101      66       1       1 
## 0.35898  0.3593 0.36075 0.36113 0.36145 0.36158 0.36178 0.36207 0.36235 
##       2       3       1       2       1       3       1       1       1 
## 0.36275  0.3628 0.36336 0.36346 0.36421 0.36428 0.36438 0.36461 0.36548 
##       1       4       1       1       1       1       9       1       1 
## 0.36564 0.36574 0.36578 0.36666 0.36676 0.36697 0.36716 0.36844 0.36845 
##       1       1       1       1       8       1       1       1       1 
## 0.36895 0.36929 0.36937 0.36945  0.3696 0.36983 
##       1       1       1       2       1       1

There are 3410 loans with BorrowerAPR at 0.35797 and 1531 loans at 0.35643.

plot of chunk PS_v_BorrowerAPR_oddity2

It appears that these loans are mainly within period of 2011-2013. Did something happen after the beginnig of 2013 which led to the reduction?

plot of chunk PS_v_EstimatedReturn_scatter

The relationship between ProsperScore and EstimatedReturn appears to be similar to the one between ProsperScore and BorrowerAPR, i.e. negative correlation.

plot of chunk PS_v_IncomeRange

IncomeRange lower than or equal to $50K (including “Not employed”) get median value of ProsperScore 5. $50K-$100K gets 6 and over $100K gets 7.

plot of chunk PS_v_StatedMonthlyIncome

The boxplot shows that the median value of StatedMonthlyIncome tend to be higher with increase of the Prosper Score. Yet there is only one oddity: ProsperScore of 1 has higher income than Score 2 and even 5. This could be due to that some people lied about their income which led to a low credit ratings and thus a low ProsperScore. The median values of StatedMonthlyIncome against ProsperScore appears to be an exponential distribution.

plot of chunk PS_v_StatedMonthlyIncome2

The above plot shows the median value of StatedMonthlyIncome presents an increasing trend against ProsperScore.

plot of chunk PS_v_StatedMonthlyIncome3

Above plot shows the same plot as the previous one but with split of IncomeVerifiable. It is noticed that the oddity of Prosper 1 still exists even in the IncomeVerifiable = TRUE subset. This means there are other factors that contribute to this oddity.

plot of chunk PS_v_DebtToIncomeRatio

It makes sense that DebtToIncomeRatio is negatively correlated with ProsperScore.

plot of chunk PS_v_CreditHistory

plot of chunk PS_v_CreditHistory2

It seems that the TotalProsperPaymentsBills, CreditScoreRangeLower, CreditScoreRangeUpper are positively correlated with ProsperScore. AmountDelinquent, DelinquenciesLast7Years are negatively correlated with ProsperScore.

LoanOriginalAmount

plot of chunk PS_v_LoanOriginalAmount_box

In general, the median values of LoanOriginalAmount shows that the LoanOriginalAmount increases with increase of ProsperScore. This makes sense as lenders could be cautious with lending larger amount with larger risks (i.e. low ProsperScore). The NA values for loans before 2009, which led to the question that why those loans have much lower LoanOriginalAmount with median value around 4000?

plot of chunk BorrowerAPR_v_LoanOriginalAmount_scatter

With the increase of LoanOriginalAmount, the width of BorrowerAPR is narrowing down from 0.05-0.40 for $1K to 0.10-0.20 at $35K.

plot of chunk cut_LoanOriginalAmount_trans

The above faceted histograms confirmed this pattern. This indicates that the fewer APR options when quoting larger loans.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As described in the Bivariate Plots Section, ProsperScore has strong negative correlation with BorrowerAPR, LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn. ProsperScore has strong positive correlation with CreditScoreRangeLower and CreditScoreRangeUpper (these two are correlated with each other). It has some correlation with DebtToIncomeRatio (negative) and StatedMonthlyIncome (positive).

Also, as the LoanOriginalAmount increases, the BorrowerAPR is narrowing down, which means the options for larger loans are limited.

By investigating the distributions of BorrowerAPR by ProsperScore, it is found that there are quite a lot of loans with interest of 0.35797 and 0.35643 within ProsperScore range of 2-5. A further check shows that these loans are mainly created in 2011-2012.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

CreditScoreRangeLower and CreditScoreRangeUpper are correlated with each other. BorrowerAPR and LenderYield are correlated to each other.

What was the strongest relationship you found?

The ProsperScore is strongly correlated with interest-related variables, particularly BorrowerAPR. It has some correlation with CreditScoreRangeLower/CreditScoreRangeUpper, but this is lower than the one with BorrowerAPR, LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn.

Multivariate Plots Section

plot of chunk PS_v_BorrowerAPR_IncomeRange

The scatter plot confirms that with increase of ProsperScore, the BorrowerAPR tends to be lower. However, there seems to be no obvious correlation between StatedMonthlyIncome and the ProsperScore.

plot of chunk PS_v_BorrowerAPR_CreditScoreRangeLower

plot of chunk PS_v_BorrowerAPR_CreditScoreRangeLower2

It seems that the CreditScoreRangeLower present a negative linear correlation (though a bit weak) with BorrowerAPR. Also, BorrowerAPR is within a smaller range with larger median values and more lower ProsperScore when the CreditScoreRangeLower is small.

plot of chunk PS_v_EstimatedEffectiveYield

The distribution for BorrowerAPR and EstimatedEffectiveYield is clearly consisted of multiple straight lines. On the direction from top left to bottom right, the ProsperScore tends to be decreasing but seems to be bimodal or multimodal for lower range scores (1-5).

plot of chunk PS_v_LenderYield

Again it is noticably true that the distribution is consisted of multiple linear relationships, but the separation is not that obvious.

plot of chunk interest_var_scatter_matrix

Above scatter matrix presents the correlations between the interest variables. BorrowerAPR and LenderYield posed a strong linear relationship.

## 
## Calls:
## m1: lm(formula = ProsperScore ~ BorrowerAPR, data = df)
## m2: lm(formula = ProsperScore ~ BorrowerAPR + DebtToIncomeRatio + 
##     log(StatedMonthlyIncome), data = df)
## m3: lm(formula = ProsperScore ~ BorrowerAPR + DebtToIncomeRatio + 
##     log(StatedMonthlyIncome) + CreditScoreRangeLower, data = df)
## m4: lm(formula = ProsperScore ~ BorrowerAPR + DebtToIncomeRatio + 
##     log(StatedMonthlyIncome) + CreditScoreRangeLower + OpenRevolvingAccounts + 
##     CurrentDelinquencies + AmountDelinquent + DelinquenciesLast7Years, 
##     data = df)
## m5: lm(formula = ProsperScore ~ BorrowerAPR + DebtToIncomeRatio + 
##     log(StatedMonthlyIncome) + CreditScoreRangeLower + OpenRevolvingAccounts + 
##     CurrentDelinquencies + AmountDelinquent + DelinquenciesLast7Years + 
##     EstimatedEffectiveYield + EstimatedLoss + EstimatedReturn, 
##     data = df)
## 
## =====================================================================================
##                               m1          m2          m3          m4          m5     
## -------------------------------------------------------------------------------------
## (Intercept)                10.454***    9.294***    8.191***    7.808***   10.150*** 
##                            (0.018)     (0.110)     (0.162)     (0.163)     (0.153)   
## BorrowerAPR               -19.873***  -19.677***  -19.195***  -19.324***  -56.296*** 
##                            (0.076)     (0.082)     (0.096)     (0.097)     (0.649)   
## DebtToIncomeRatio                      -0.326***   -0.342***   -0.220***   -0.095*** 
##                                        (0.022)     (0.022)     (0.022)     (0.020)   
## log(StatedMonthlyIncome)                0.147***    0.143***    0.217***    0.217*** 
##                                        (0.012)     (0.012)     (0.012)     (0.011)   
## CreditScoreRangeLower                               0.001***    0.001***   -0.002*** 
##                                                    (0.000)     (0.000)     (0.000)   
## OpenRevolvingAccounts                                          -0.035***   -0.022*** 
##                                                                (0.001)     (0.001)   
## CurrentDelinquencies                                            0.003       0.006    
##                                                                (0.006)     (0.006)   
## AmountDelinquent                                               -0.000**    -0.000    
##                                                                (0.000)     (0.000)   
## DelinquenciesLast7Years                                        -0.002*      0.002*   
##                                                                (0.001)     (0.001)   
## EstimatedEffectiveYield                                                   -11.515*** 
##                                                                            (0.206)   
## EstimatedLoss                                                              43.947*** 
##                                                                            (0.783)   
## EstimatedReturn                                                            67.401*** 
##                                                                            (0.611)   
## -------------------------------------------------------------------------------------
## R-squared                       0.447       0.459       0.460       0.464       0.563
## adj. R-squared                  0.447       0.459       0.460       0.464       0.563
## sigma                           1.768       1.737       1.736       1.729       1.561
## F                           68477.863   21976.935   16522.804    8395.500    9097.860
## p                               0.000       0.000       0.000       0.000       0.000
## Log-likelihood            -168748.663 -152861.786 -152818.227 -152528.761 -144581.194
## Deviance                   265198.520  233938.238  233675.608  231937.806  188957.385
## AIC                        337503.326  305733.571  305648.453  305077.521  289188.389
## BIC                        337531.372  305779.865  305704.006  305170.109  289308.753
## N                           84853       77557       77557       77557       77557    
## =====================================================================================

The results of the model seems to be quite bad as it only predicts 17% of the correct ProsperScore.

## 
##     0     1 
## 64563 12994
##    ProsperScore ProsperScore.pred
## 1            NA                 7
## 2             7                 7
## 3            NA                NA
## 4             9                 8
## 5             4                 5
## 6            10                 7
## 7             2                 4
## 8             4                 5
## 9             9                 8
## 10           11                 8
## 11            7                 5
## 12           NA                 7
## 13            4                 6
## 14            8                 7
## 15            8                 7
## 16            5                 3
## 17            4                 4
## 18           NA                NA
## 19            7                 8
## 20            8                 5
## 21            7                 7
## 22           NA                 6
## 23            2                 0
## 24            5                 4
## 25            5                 6
## 26            3                 4
## 27            3                 4
## 28            9                 8
## 29            4                 6
## 30            6                 7
## 31            9                 7
## 32            5                 3
## 33            8                 7
## 34           10                 9
## 35            5                 5
## 36            8                 6
## 37            2                 4
## 38            6                 4
## 39            9                 8
## 40           NA                NA
## 41            4                NA
## 42            8                 6
## 43           NA                NA
## 44            6                 6
## 45            5                 6
## 46            7                 7
## 47           NA                 6
## 48            8                 7
## 49            6                 7
## 50           10                 8

plot of chunk modelling2

The reason for the underperformance is that the model doesn't predict correctly for loans with ProsperScore over 9 and equal to 1.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I plotted the StatedMonthlyIncome and BorrowerAPR split by ProsperScore but found the ProsperScore is a little crowded. So I split the ProsperScore into 4 buckets. The relationship confirmed that ProsperScore decreases with increase of BorrowerAPR, which means loans with lower risks have higher ProsperScore. But there is only a very weak negative correlation between StatedMonthlyIncome and BorrowerAPR. Another plot showing the relationship between CreditScoreRangeLower and BorrowerAPR presents BorrowerAPR tends to be higher with lower CreditScoreRangeLower and higher ProsperScore means higher Credit Score.

Also, by looking at the variables of BorrowerAPR, LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn, I found BorrowerAPR and LenderYield are tightly correlated with pearson correlation neaer 1. It is also interesting to see that EstimatedEffectiveYield are always higher than EstimatedReturn. The EstimatedLoss and EstimatedEffectiveYield tend to be a linear relationship.

Were there any interesting or surprising interactions between features?

It is interesting to see the income (StatedMonthlyIncome) doesn't impact the BorrowerAPR so much (though it has a very weak impact). This is probably that the income itself is not a strong indicator of high credit from the point view of Prosper.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model with the dataset. The main strength of the model is that it highlights the what factors contribute to the ProsperScore. For example, for model m5, for every 0.05 increase of BorrowerAPR, the ProsperScore decreases by about 1 (=0.05 * -19.324). However, the R squared of this model is small (~0.464), which means another half of the variance is unexplained.


Final Plots and Summary

Plot One

plot of chunk Plot_One

Description One

I have chosen this plot because it is reflective to show how the distribution of Borrower APR changes at different Prosper Scores.

This faceted histogram presents a clear negative relationship between Borrower APR and Prosper Score. The former is generally low at higher Prosper Score, as the peak of the APR distribution shifts from right to left as Prosper Score increases. However, it is surprising that there are some loans with oddly high APR near 36% even though the Prosper Score is within a lower range (2-5).

Plot Two

plot of chunk Plot_Two

Description Two

I have chosen this boxplot as it clearly demonstrates a positive correlation between monthly income and Prosper Score (though the correlation is weak, only around 0.084). The plot shows that the monthly income is generally higher given a higher Prosper Score. The only exception is the prosper score at 1, when the monthly income is slightly higher than score of 2-5, which is possibly because of income overstatement or non-verifiable issues.

Plot Three

plot of chunk Plot_Three

Description Three

I chose this time series plot because it tells a story about the Great Resessions of 2008-2012.

The number of loans originated increases since 2006 and stumbled to 0 since the end of 2008, as a result of the 2008 Global Finacial Crisis. Since the year end of 2009, the loans started to recover and climbed to 100 per day during the 3rd quarter in 2012 until dropping to 50 due to the 2012 Financial Crisis. Since 2013, loans started to increase and had a significant jump from 50 to 4 times more over the next 12 months.


Reflection

The biggest challenge for this project is the selection of variables. There are many variables (81) in this dataset and it is quite hard to decide which variables to begin with. Fortunately, following the example project, I was able to create a correlation matrix which eased the variable selection process. I soon realised the ProsperScore is highly correlated with interest related variables such as BorrowerAPR and LenderYield. Before performing any analyses, I did the data profiling including looking at the types and structure of data columns. After performing the univariate exploration, I was able to get a general idea of the distribution of each interesting variable and discovered that there are so many loans with BorrowAPR at around 0.36. I was also able to tell the stories from the financial crisis impact from the time series plot from LoanOriginationDate. In the Bivariate Plots Section, I confirmed the negative correlation between BorrowerAPR and ProsperScore by looking at the jittered scatterplot and boxplot. I still struggled to understand how the ProsperScore can be predicted from properly selected variables. Further effort could probably be focused on analysing the variable association with the ProsperScore quantatively in order to predict the score more accurately.